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1. INTRODUCTION 

Stop-words are words with the highest frequencies in a document without any significant 
information [1]. They are characterized by having common relations within a cluster [2]. The presence of 
stop words has insignificant effect on the overall semantic of sentences, usually used to satisfied the 
grammatical rule of the language [3]. They were described as noise which is evenly distributed over a 
document [4]. Removal of stop-words abridge the total bytes of the documents, therefore speedup the 
processing time of most information retrieval (IR) applications such as automatic text summarization, 
questions answering and recommendation system. It is described as a way of improving the performance of 
information retrieval in general [5]-[7] and such removal better the performance of some applications like 
search engines [8], text classification [9], detection of keyphrases [10], automatic detection of grammatical 
errors [11], computation of semantic similarity [12], identification sequence patterns [13], spam detection in 
e-mail [14], detection and removing unwanted videos [15], detection of hate speech [16], identification of 
named entity [17]. Non removal of stop words affects the process of automatic selecting keywords or 
important phrases from a document [18], [19]. It is the most vital preprocessing activity in Information 
Retrieval and Artificial Intelligence researches [20], [21]. 

The stop-words are categorized into two: the grammar-specific stop-words and the domain-specific 
stop-words. The grammar-specific stop-words includes the list of language pronouns, prepositions, 
conjunctions, adjectives, adverbs and prefixes [20]. The domain-specific stop-words is specifically to a 
particular domain of information. Stop-words are generated using diverse approaches including, dictionary, 
machine learning, words frequency, entropy-based, statistical-based and part of speech (PoS) approaches. 
The simplest approach for creating stop-words is frequency method [5]. The method generates stop words by 
computing the frequency of words, those with the highest frequencies in the corpus are consider the stop 
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words. The statistical method determines the average probability and variance of words, the words with the 
highest probabilities and lowest variances are consider the stop words. The information theory model works 
by considering the information weight of a word. The method generates the stop words by considering the 
entropy of words, those with lowest entropy are made of the list. 

This research generated a list of stop words for hausa language. Hausa is a Chadic language, widely 
spoken in West Africa by about one hundred and fifty million (150,00,000) peoples at either first or second 
language. The language is the most widely spoken indigenous language in West Africa. Its native speakers 
spread across southern Niger, northern Nigeria, Ghana and Northern Cameroun. It is also used for trades in 
other places like Equatorial Africa, Chard and Sudan. The hausa is the largest ethnic group in west and north- 
central Africa. The significant number of hausa speakers are also found in Saudi Kingdom, Benin republic, 
ivory coast and togo. Many international media including british broadcasting corporation (BBC), voice of 
america (VOA), radio france international (RFI), and china radio international (CRI) broadcast ranges of 
programs in the language. There are a lot of literature related to religion and traditions written in the 
languages, which may be of interest to many readers across the globe. The remaining parts of the paper are 
organized as follows. Section 2 presents the review of the related works. Section 3 presents the methodology 
of the research. Section 4 presents the results of the experiments. Section 5 presents the research conclusion 
and future research directions. 

A comprehensive list of stop-words has been developed for English language longtime ago. 
Recently, various researches proposed a stop-words list for other languages such as Hindi [3], Malay [22], 
Arabic language [21], [23], [24], Thai [25], Gujarati language [26], Urdu text [27]. Similarly, Girmaw and 
Khedkar [20] generated a stop-words list for Amharic language using aggregated-based technique, by 
combining word frequency and entropy method. In the paper, Raulji and Saini [28] generated stop-words list 
for Sanskrit language using hybrid method, the method used automatic algorithm with some involvement of 
human experts. In the paper, Asubiaro [29] generated stop-words list for Yoruba language using entropy- 
based approach. Similarly, another list was generated for Yoruba language using aggregated method by 
combining frequency and words entropy techniques [4]. 

Stop-words list was automatically generated for Egyptian dialect using frequency method [30]. The 
aggregate method was used for generation of stop-words list for Persian language by combining statistical 
and similarity function approaches [31]. A deterministic finite automaton was used for generation of stop- 
words for Hindi text [32]. More so, machine learning algorithm was used for automatic generation of Bengali 
stop words [33]. Similarly, Sadeghi and Vegas [18] automatically generated a list of light stop-words for 
Persian text using aggregated approach by combining frequency, statistics and entropy methods. Some 
researches focus on domain specific stop words, Na and Xu [5] created a stop-words list for Chinese patents 
using both frequency and statistical approaches. The stop words list was also generated for technical 
language for the use of engineering and related field of knowledges [34]. 


2. METHOD 

The details of research methods and the description of dataset used in the research are presented in 
the following subsections. The research used frequency and statistics approach to generates the hausa stop 
words. The dataset is primarily collected from varous hausa news websites, the final corpus is comprised of 
841 hausa text news articles. 


2.1. Dataset 

A hausa corpus was primarily collected for the experiments. The task of corpus creation nowadays 
is a challenging task due to the people style of writing. Many people write text on internet using non-standard 
styles including too many abbreviations and mixing languages. To minimize that, the corpus was only 
consider from standard hausa news websites. The corpus is comprised of 841 hausa news articles from BCC 
hausa; VOA hausa; RFI hausa; aminiyya newspaper and hausa leadership newspaper. The text file was 
converted to UTF-8 format for Python compatibility, as illustrated in Table 1. 


Table 1. Description of dataset 
Corpora Number of documents Total words 


Corpus 1 130 40869 
Corpus 2 323 105436 
Corpus 3 477 146601 
Corpus 4 602 166813 
Corpus 5 841 187143 
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2.2. Frequency method 

The frequency method was used for creation of hausa stop-words in the research. The term 
frequency of a word is simply referred to the word count or number of its occurrence in a given corpus. 
Mathematically the term frequency of a word is determined as (1): 


tf = (tfc)/ Eft o (1) 


where, tf, c is Term frequency in a corpus and È ft, c total word number of terms of corpus. The stop-words 
are generated using a frequency method as follows: 

— Perform sentence and word level tokenization 

— Generate words frequencies in the corpus 

— Sort words based on their frequencies 

— Select the top rank words 


2.3. Statistics 
The hausa stop-words are further generated in the research using statistical method in the following 
steps: 
— Perform sentence and word level tokenization 
— Calculate each word’s SAT value in the corpus 
— Sort the word according to their SAT in descending order 
— Extract the words with high SAT as candidates, and filter them manually 
Suppose the corpus D={di}, 1=<i<=N. N refers to the count of document. The set of words in corpus is 
denoted as W={wj}. The average probability MP of word wj in D is: 


MP(Wj) = asin" (2) 


pij is the frequency probability of wj in di. In other words, pij equals to wj ‘s frequency in di divided by the 
number of words in di. If a word has a high MP value, it implies that this word occurs frequently in the whole 
corpus. The variance VP of wj in D is: 


3 <i< ij- Wj 
VP(Wj — Dis ae D2 (3) 


If a word has a low VP value, it implies that this word occurs uniformly in the whole corpus. The SAT of wj 
in Dis: 


; MP(Wj) 

SAT(Wj) = -=== 4 

If a word has a high SAT value, it implies that this word occurs frequently and uniformly in the whole 
corpus. The word like this is very likely to be a stop word. The intersection of words appeared in both lists 
using frequency-based and statistics method is taken and consider as the final stop words, as illustrated in 
Figure 1. 


Hausa corpus 
from news portal 


Calculate term 
requency 


Calculate SAT 


Select the high 
rank words 


Select words 


make the final 
list 


Figure 1. Architecture of the proposed work 
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3. RESULTS AND DISCUSSION 

Five different sizes of corpora were used for the experiments. The corpus with 40869 total words 
size; 105436 total words size; 146601 total words size; 166813 total words size; and 187143 total words size. 
The experiments produced five different stop words lists, as illustrated in Table 2. 

The experiments were conducted on the same dataset using the statistical method and the results are 
presented in Table 3. The results presented the variance of individual term using different corpus sizes. The 
list is madeup 100 hausa words with their variance. 


Table 2. Top 20 highest frequency words using frequency method under different corpora 
40869 words 105436 words 146601 words 166813 words 187143 words 
Term: Count Term: Count Term: Count Term: Count Term: Count 


da:3258 da:8431 da: 11164 da:12835 da: 14427 
ya:1177 ya:2826 ta: 4140 ta:4740 ya: 5076 
ta:1028 ta:2757 ya: 4047 ya:4501 ta: 4987 
na:686 na:1848 na: 2514 na:2894 na: 3324 
ba:502 ba: 1296 ba: 1701 ba: 1807 ba: 1971 
yi:457 yi:1161 kuma: 1595 kuma:1757 kuma: 1848 
su:421 kuma:1112 yi: 1426 yi:1526 yi: 1828 
ce:421 ne:1000 ne: 1311 ne:1431 ne: 1508 
kuma:418 ce:908 ce: 1101 su:1193 ce: 1330 
ne:397 su:879 su: 1078 ce:1189 su: 1329 
za:317 ke: 743 za: 922 ke:1035 ke: 1281 
ke:286 za:711 ke: 914 daga: 1032 suka: 1144 
daga:282 daga:647 daga: 875 suka:1008 daga: 1076 
suka:239 shi:634 dan: 861 za:986 mai: 1043 
shi:231 mai:626 mai: 833 mai:918 za: 1006 
wa:228 suka: 614 cewa: 824 dan:910 cewa: 987 
sun:222 kan: 593 suka: 798 cewa:890 yan: 974 
ga:220 sun: 591 shi: 796 shi:845 sun: 957 
cikin:217 wa:586 aka:738 sun: 841 kan: 932 
yan:207 aka:576 kan:725 kan: 840 aka: 931 


Table 3. Top 20 words with the highest spread/distribution under different corpora 


187143 words 166813 words 146601 words 105436 words 40869 words 
Term: Variance Term: Variance Term: Variance Term: Variance Term: Variance 
da:29886.18 da:25886.18 da:23005.28 da:22746.10 da:20640.80 
ya:24776.76 ta:21176.76 ya:20036.76 ya:19903.81 ya:17005.60 
ta:23910.11 ya:20995.11 ta:18995.71 ta:16500.20 ta:14005.20 
kuma:14633.61 na:13611.00 na:13004.21 na:12401.05 na:10650.10 
ba:9062.23 yi:8057.23 ba:7800.30 ba:7091.10 ba:6005.11 
yi:7139.54 su:7038.54 yi:7001.08 yi:6805.40 yi:4005.60 
su:3842.31 ba:3072.61 su:2984.71 su:2800.40 su:2250.10 
ce:2509.41 ce:2206.91 ce:2109.10 ce:2000.97 ce:1807.50 

na:2013.87 kuma:2001.01 kuma:1903.11 kuma:1730.20 kuma:1540.10 
ne:1500.32 ne:1302.12 ne:1205.01 ne:1140.46 ne:1000.50 
ke:1200.89 za:1150.75 za:1075.00 za:900.70 za:700.70 
za:878.01 ke:798.41 ke:645.31 ke:570.05 ke:502.15 
yan:623.11 daga:611.55 daga:599.01 daga:500.44 daga:425.10 
suka:501.22 suka:499.92 suka:474.75 suka:416.20 suka:397.63 
shi:453.07 shi:413.37 shi:401.05 shi:395.00 shi:380.60 
wa:400.33 wa:395.23 wa:360.48 yan:340.10 wa:300.10 
sun:398.11 sun:377.21 sun:327.06 wa:311.30 sun:278.20 
ga:225.01 ga:205.15 ga:200.075 ga:190.05 yi:174.50 
cikin:214.11 cikin:195.61 cikin: 180.65 cikin:165.50 cikin:159.71 
daga:200.22 yan: 160.52 yan:156.20 sun:135.70 ga:120.40 


The proportion of common parts were computed in the adjacent lists, and the list achieved saturation 
after fourth experiment, as illustrated in Table 4. The results illustrated the proportion of common parts using 
different words selection; top 25, top 50 and top 75. The best scores were obtained using the largest corpus 
comprising of 166813-187143 words. 

The accuracy of the proportion increases with the increase of the corpus size, thus the larger the 
words in the documents the more accurate the proportion. Also, the accuracy of the list is affected by the 
number of stop-words in the lists, the lower the number the better the accuracy of the list, as shown in Figure 2. 
Finally total number of 81 words were selected for the list. 
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Table 4. Proportion of common words under corpora with different scales 
Corpus size _40869-105436 words 105436-146601 words 146601-166813 words 166813-187143 words 


Top 25 0.92 0.96 0.96 1 

Top 50 0.88 0.94 0.96 0.96 

Top 75 0.88 0.88 0.93 0.99 
1.02 
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Figure 2. A bar chart for proportion of common words under corpora with different scales 


3.1. Final list 


The final list comprising of 81 distinct hausa stop words generated using the method described 


above are presented in Table 5. The list is created by selecting the words that produced by both the frequency 
and statistics methods. It comprised of most common hausa prepositions, conjunction pronouns. 


4. 


Table 5. Final list of hausa stop words 


List of hausa Stop Words 
da wannan yake suka daga idan abin cikin 
ya wa suke sun mai yayin ana shi 
ta wanda hakan wasu za babu in bayan 
na an hada kan cewa baya ita ga 
ba wani akan ma yan tare akwai kai 
kuma sai aka kamar ko yadda sake amma 
yi masu bai tun inda don irin sa 
ne domin mu wadanda samu ake tana wajen 
ce dai ke su yanzu zai ciki har 


CONCLUSION 


Removing stop words is crucial in natural language processing and general artificial intelligence 


researches. Due to the non-availability of hausa stop words, this research filled the gap by creating a general 
list of hausa stop words using both frequency and statistics method. The list is created by selecting the words 
that appears using both methods. The total of 81 hausa words were finally selected after various 
consideration. 
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